Using genomic scars to select immunotherapy beneficiaries in advanced non-small cell lung cancer

In advanced non-small cell lung cancer (NSCLC), response to immunotherapy is difficult to predict from pre-treatment information. Given the toxicity of immunotherapy and its financial burden on the healthcare system, we set out to identify patients for whom treatment is effective. To this end, we used mutational signatures from DNA mutations in pre-treatment tissue. Single base substitutions, doublet base substitutions, indels, and copy number alteration signatures were analysed in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=101$$\end{document}m=101 patients (the discovery set). We found that tobacco smoking signature (SBS4) and thiopurine chemotherapy exposure-associated signature (SBS87) were linked to durable benefit. Combining both signatures in a machine learning model separated patients with a progression-free survival hazard ratio of 0.40\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{-0.17}^{+0.28}$$\end{document}-0.17+0.28 on the cross-validated discovery set and 0.24\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{-0.14}^{+0.31}$$\end{document}-0.14+0.31 on an independent external validation set (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m=56$$\end{document}m=56). This paper demonstrates that the fingerprints of mutagenesis, codified through mutational signatures, select advanced NSCLC patients who may benefit from immunotherapy, thus potentially reducing unnecessary patient burden.

A.2 Upstream RNA processing FASTQ files were processed by sample-wise pooling of reads from different lanes and subsequent trimming with Trimmomatic [5] using a four-base sliding window to cut reads falling below a quality of 25, dropping reads shorter than 50 bases, and trimming the 8 leading bases (the head) if below the quality threshold (and otherwise default parameters, as suggested in the manual). The subsequent sequencing data were analysed using FastQC and aligned with STAR [6] using the thirteenth release for the GRCh38 human reference assembly and Gencode's gene annotation, version 36, to guide read mapping. After alignment, Mul-tiQC [7] was used to compare overall sample quality allowing us to exclude two anomalous samples of inferior quality. Two additional RNA samples were excluded from downstream analysis because the patients had insufficient followup to unambiguously assign a durable benefit label (leaving a total of m = 36). The remaining aligned samples (in coordinate sorted BAM format) were in-dexed using samtools [8]. The somatic DNA mutations were tracked at the RNA level by generating pile-ups at the (remapped) genomic position to obtain the variant allele frequency, ω mut (un-remappable variants were discarded). The number of RNA counts were estimated per exon (as suggested in the manual) using HTSeq-count [9] with the aforementioned gene annotation. To estimate the amount of mutant RNA we first computed the number of transcripts per million, t α , of transcript α. Briefly, using the RNA counts per transcript, n (α) , and the transcript length, l (α) , (in base pairs), its' value is defined by with Z a normalisation constant Z = α n (α) l (α) .

A.3 Mutated RNA re-restimation
Observe that the amount of mutated RNA molecules depend on the sample's overall tumor purity f t , as well as the tendency of tumor cells to express and/or break down the mutated transcripts. We therefore introduce a (transcript specific) enrichment factor, r α , defined as the ratio between the allele frequency ω (α) mut in transcript α and overall tumor purity, ft . This enrichment factor, r α , re-estimates the RNA from the samples as if they had 100% tumor purity. When a transcript contained more than one mutation, the allele frequency was, for simplicity, averaged. Finally, the estimated the amount of mutant RNA molecules is computed as: (2)

A.4 RNA per signature
For each mutational signature, i, dominant mutations, j, were determined by selecting the smallest set of mutations, T , that account for ≥ 50% of somatic mutations (i.e., j∈T H ij ≥ 0.5). Using these dominant mutations, RNA expression of transcripts containing mutations in set T [Eq. (2)] were pooled for analysis.

A.5 Net benefit
For the net-benefit analysis we use the following definitions from [10] and [11] where t is the probability threshold for accepting a positive prediction, and TP, FP, TN, FN are the number of true positives, false positives, true negatives, and false negatives, respectively. Here a low t indicates that we attach great value to avoiding false negatives, and a high t if we attach great value to avoiding false positives. The combined net benefit is defined as it's sum NB combined (t) = NB treated (t) + NB untreated (t). A metric that directly follows is the integrated net benefit from [12] A downside of the net benefit is lack of interpretability and it does not transparently include quality adjusted life years or financial cost. To augment the net benefit we also consider cost as a function C(t) of true positives, true negatives, false positive and false negatives, assuming independent costs: where ω represents the weights. Here weights are taken as ω TN = 0, ω TP = 100, ω FN = 200, ω FP = 100; this can be translated as; we attach a cost of 0 to true negatives as we do not incur extra costs, a cost of 100 to an accurate estimation that immunotherapy is necessary, i.e. we attach a cost of 100 to an effective immunotherapy, then a cost of 300 to a false rejection of immunotherapy to account for incurring unnecessary suffering, and again a cost of 200 for an ineffective immunotherapy, the reasoning for the latter is that we have the cost of the immunotherapy plus the cost of treating the side-effects. The attribution of weights should be done with extreme care, and we want to emphasize that we merely present this as an addition to the classical net benefit analysis.

B.1 Correlation between signatures and prior treatment
The majority of patients in the discovery cohort had a history of chemotherapy and/or radiotherapy. Both radiation exposure [13,14,15] and chemotherapy [13] are known to cause distinguishing mutations. We therefore analysed all signatures (SBS, DB, indel, and CNV) to see if that led to any significant signature differences between the exposed versus unexposed groups. For both chemotherapy and radiotherapy, no differences between the groups was observed. Our negative findings are explained by the fact that (i) lung cancer is, after melanoma, one of the cancers with the highest mutation burden [16], (ii) signatures partially overlap, (iii) our population is small, and (iv) our false discovery control is strict. Treatment-related mutations could be buried by other mutation sources generating similar mutations -c.f., the overlap of platinum chemotherapy signature SBS35 and smoking associated signature SBS4. The resulting mutational signature differences may therefore be too small to detect in our population, given our multiple testing correction. In the validation dataset (right), there was no significant difference in the ROC AUC (0.69 +0.14 −0.14 versus 0.78 +0.12 −0.14 , respectively, p = 0.18 PPT). Estimates and corresponding 95% CIs are indicated by sub and superscripts.